class: title-slide <br> <br> # A Data Mining Approach for Detecting Collusion in Unproctored Online Exams<br> .padding_left.pull-down.white[ J. Langerbein, T. Massing, .bold[_J. Klenke_], M. Striewe, M. Goedicke, C. Hanck, N. Reckmann <br> <br> <br> `\(15^{th}\)` International Conference on Educational Data Mining Bangalore, 11-24 July, 2023 ] --- # Outline `\(\quad\)` 1. [Introduction](#introduction) 1. [Related work](#related_work) 1. [Methodology](#methodology) 1. [Empirical Results](#empirical_results) 1. [Discussion](#discussion) 1. [References](#references) --- name: introduction # Introduction * COVID-19 forced universities to switch to online classes and exams. * Proctoring online exams with video conference software was often prohibited due to data protection regulations and economically unfeasible. * In this case study take-home exams were conducted as open-book, but collaboration was strictly prohibited. * Hierarchical clustering algorithms were used to identify groups of potentially colluding students. * The method successfully found groups with nearly identical exams. * A proctored comparison group helped categorize student groups as "outstandingly similar". --- name: related_work # Related work * Limited research exists on unproctored exams at universities prior to the pandemic. * <a href='#bib-cleophas2021s'>Cleophas et al. (2021)</a> propose a method using event logs to detect collusion in unproctored exams. * Previous studies focused on similarity measures for programming exams based on keyboard patterns, f.e. <a href='#bib-Hellas_2017'>Hellas et al. (2017)</a> and <a href='#bib-Leinonen_2016'>Leinonen et al. (2016)</a>. * Other literature (f.e. <a href='#bib-hemming2010online'>Hemming (2010)</a>) relies on surveys or interviews, lacking actual student behavior data on collusion. * Some studies suggest that unsupervised online exams may lead to collusion. * <a href='#bib-hollister2009proctored'>Hollister et al. (2009)</a> used GPA and final exam scores to analyze collusion but not data collected during the exam. --- name: methodology # Methodology - Data set * Data for the study was collected from the "Descriptive Statistics" course at the University Duisburg-Essen, Germany. * The test group took the unproctored exam at home during the COVID-19 pandemic, while the comparison group took a proctored exam in class before the pandemic. * The exams consisted of arithmetical problems, programming tasks in R, and a short essay task. * Event logs captured students' activities and time stamps during the exams, and points achieved per task were recorded. * Data cleaning was conducted, removing students with minimal participation or achievement, as well as those with reported internet problems. * Despite differences in exam format, both groups shared similar content and learning goals, with opportunities for questions and discussions. --- # Methodology - Data set `\(\quad\)` `\(\quad\)` | | Comparison group | Test group | | :------: | :------: | :------: | | Year | `\(2018/2019\)` | `\(2020/2021\)` | | N | `\(109\)` | `\(151\)` | | Style | proctored/in class | unproctored/at home | | Total points | `\(60\)` | `\(60\)` | | Sub tasks | `\(19\)` | `\(17\)` | | Minutes | `\(60\)` | `\(60\)` | --- # Methodology - Model * Agglomerative (bottom-up) hierarchical clustering algorithm * Global pairwise dissimilarities `$$D(x_i, x_{i'}) = \frac{1}{h} \sum_{j=1}^h w_j \cdot d_j(x_{ij}, x_{i'j}) \quad with \quad \sum_{j=1}^h w_j = 1$$` * `\(D(x_i, x_{i'})\)`: Global pairwise dissimilarity * `\(d_j(x_{ij}, x_{i'j})\)`: Pairwise attribute dissimilarity * `\(i = 1, ..., N\)` with `\(N = 151\)` students * `\(j = 1, ..., h\)` attributes * We compared two different kinds of attributes: * Dissimilarities in the student´s event patters (time of submission) * Dissimilarities in points achieved --- # Methodology - Model ### Dissimilarities in the student´s event patters (time of submission) * `\(d_j^L(v_{ij}, v_{i'j})\)` with weights `\(w_j^L\)` * We divided the examination into `\(m = 1, ... , 70\)` intervals, since both exams took `\(70\)` min. * `\(v_{ijm}\)` denotes the count of answers of student `\(i\)` during the `\(m\)`-th interval. * Manhatten metric used for calculation of the pairwise attribute dissimilarity. `\(\quad\)` `$$d_j^L(v_{ij}, v_{i'j}) = \sum_{m=1}^{K=70} | v_{ijm} - v_{i'jm} |$$` --- # Methodology - Model ### Dissimilarities in points achieved * `\(d_j^P(s_{ij}, s_{i'j})\)` with weights `\(w_j^P\)` * `\(s_{ij}\)` denotes the points achieved by student `\(i\)` in the `\(j\)`-th sub task. * Absolute difference used as dissimilarity measure. `\(\quad\)` `$$d_j^P(s_{ij}, s_{i'j}) = | s_{ij} - s_{i'j} |$$` --- # Methodology - Model ### Full model `$$D(s_i, s_{i'}, v_i, v_{i'}) = \frac{1}{h} \sum_{j=1}^h (w_j^P \cdot d_j^P (s_{ij}, s_{i'j}) + w_j^L \cdot d_j^L (v_{ij}, v_{i'j})) \quad \text{with} \quad \sum_{j=1}^h w_j^P + w_j^L =1$$` * Weights `\(w_j\)` control the influence each attribute on the global object dissimilarity. * We reduced the weights for: * R-tasks and free-text questions, since the event log might not be comparable in these cases * Points achieved * Since dissimilarity measures depend on scale, the attributes were normalized. --- name: empirical_results # Empirical results ### Test group .pull-left-2[ <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#../resources/graphics/figure1.png" alt="Figure 1: Dendogram produced by average linkage clustering of the unproctored test group (2020/21). A-F mark the clusters with the lowest dissimilarity." width="100%" /> <p class="caption">Figure 1: Dendogram produced by average linkage clustering of the unproctored test group (2020/21). A-F mark the clusters with the lowest dissimilarity.</p> </div> ] .pull-right-1[ <br> <br> <br> .blockquote.font60.middle[ ### Results - Cluster **A**, **B** and **E** have a strikingly high similarity ] ] --- # Empirical results <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#../resources/graphics/figure2.png" alt="Figure 2: Comparison of the event logs and achieved points for each of the test group´s (2020/21) six lowest dissimilaity clusters (A-F). Above the scatter plot, a bar chart is added to compare the points per subtask." width="85%" /> <p class="caption">Figure 2: Comparison of the event logs and achieved points for each of the test group´s (2020/21) six lowest dissimilaity clusters (A-F). Above the scatter plot, a bar chart is added to compare the points per subtask.</p> </div> --- # Empirical results ### Comparison group .pull-left-2[ <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#../resources/graphics/figure3.png" alt="Figure 3: Dendogram produced by average linkage clustering of the unproctored test group (2018/19). G-L mark the clusters with the lowest dissimilarity" width="100%" height="60%" /> <p class="caption">Figure 3: Dendogram produced by average linkage clustering of the unproctored test group (2018/19). G-L mark the clusters with the lowest dissimilarity</p> </div> ] .pull-right-1[ <br> <br> <br> .blockquote.font60.middle[ ### Results - In comparison no cluster seems to stand out noticeably during the proctored exam ] ] --- # Empirical results <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#../resources/graphics/figure4.png" alt="Figure 4: Comparison of the event logs and achieved points for each of the test group´s (2018/19) six lowest dissimilaity clusters (G-L). Above the scatter plot, a bar chart is added to compare the points per subtask." width="85%" /> <p class="caption">Figure 4: Comparison of the event logs and achieved points for each of the test group´s (2018/19) six lowest dissimilaity clusters (G-L). Above the scatter plot, a bar chart is added to compare the points per subtask.</p> </div> --- # Empirical results .pull-left[ <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#../resources/graphics/figure5.1.png" alt="Figure 5.1: Comparison of the non-normalised distance measures." width="100%" height="60%" /> <p class="caption">Figure 5.1: Comparison of the non-normalised distance measures.</p> </div> ] .pull-right[ <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#../resources/graphics/figure5.2.png" alt="Figure 5.2: Comparison of the normalised distance measures." width="100%" height="60%" /> <p class="caption">Figure 5.2: Comparison of the normalised distance measures.</p> </div> ] --- # Discussion * The results of hierarchical clustering algorithms are presented in a dendrogram, providing a visual representation of the clustering results. * A dendrogram resembles a tree structure, where objects are merged based on their dissimilarity in a bottom-up approach. * Various hierarchical clustering algorithms exist, and the cophenetic correlation coefficient is used to assess how well each algorithm represents the original structure in the data. * Average linkage clustering is deemed the most suitable algorithm for the analysis. * The dendrogram shows compact clusters at medium dissimilarities, with three notable clusters (**A**, **B**, and **E**) consisting of two students each, indicating the absence of collusion in larger groups. * Scatterplots and barcharts are used to examine the similarity of students' chronology and achieved points within clusters. * Comparison with the results from the comparison group supports the findings, indicating that collusion over the entire exam is unlikely, and the differences between the groups are not coincidental. --- name: discussion # Discussion * The method successfully detects at least three clusters with near identical exams. * The approach provides a basis for further examination of clusters based on comparison with a reference group, but the ground truth is not known, limiting the certainty of conclusions. * Nevertheless, the elevated risk of detection may indeed discourage students from cheating in unproctored exams. * This is not only a important step in adapting to the progressing digitization of education, but it also equips us better for unforeseen situations in the future, much like the COVID-19 pandemic. --- name: references # References .font80[ Cleophas, C., C. Hoennige, F. Meisel, et al. (2021). "Who's Cheating? Mining Patterns of Collusion from Text and Events in Online Exams". In: _Mining Patterns of Collusion from Text and Events in Online Exams (April 12, 2021)_. Hellas, A., J. Leinonen, and P. Ihantola (2017). _Plagiarism in Take-Home Exams: Help-Seeking, Collaboration, and Systematic Cheating_. ITiCSE '17. Bologna, Italy: Association for Computing Machinery, p. 238–243. ISBN: 9781450347044. DOI: 10.1145/3059009.3059065. <https://doi.org/10.1145/3059009.3059065>. Hemming, A. (2010). "Online tests and exams: lower standards or improved learning?" In: _The Law Teacher_ 44.3, pp. 283-308. Hollister, K. K. and M. L. Berenson (2009). "Proctored versus unproctored online exams: Studying the impact of exam environment on student performance". In: _Decision Sciences Journal of Innovative Education_ 7.1, pp. 271-294. Leinonen, J., K. Longi, A. Klami, et al. (2016). _Typing patterns and authentication in practical programming exams_ , pp. 160-165. ]